NHL goalies are kind of an enigma. They’re notoriously hard to evaluate and predict in a sport that is already so chaotic. They are also incredibly important to their teams and often you will hear hockey analysts citing teams’ goalies as the reason that the team is preforming well or poorly.
We are interested specifically in evaluating NHL goalies for the purpose of understanding their contracts. How do teams decide how much to pay their goalies? We want to inspect goalies in the post-lockout seasons and take a look at the best, the worst and how much they get paid.
The objective of this report is to evaluate NHL goalies and their contracts in the post-lockout era and then build a model to try and predict how much goalies will get paid on their next contract.
We wanted to specifically preform this analysis during the post-lockout era (2013-present). Hockey has changed so much in these seasons, comparing goalies from post-lockout era hockey to pre-lockout doesn’t quite make sense.
How many unique goalies played per season?
hr_data %>%
distinct(player, szn) %>%
count(szn, name = 'Number of goalies') %>%
rename(Season = szn) %>%
kable() %>%
kable_styling(bootstrap_options = c('hover'), full_width = F)
| Season | Number of goalies |
|---|---|
| 12-13 | 82 |
| 13-14 | 97 |
| 14-15 | 92 |
| 15-16 | 92 |
| 16-17 | 95 |
| 17-18 | 95 |
| 18-19 | 93 |
We see that ~95 goalies played each season. This may seem like a lot but this actually makes sense. From 2013-2017 there were 30 NHL teams in the league and from 2017-2019 there were 31. Each NHL team usually has a starting goalie, a back-up goalie, and a third-string “emergency” goalie. Teams calling-up goalies from their farm teams in emergency situations is also common. So you would (assuming that injuries are uniform) that number of teams in league * 3 is the approximate number of goalies who play each season.
In total there were 187 goalies that played at least 1 game in the NHL from 2013-2019. However, it makes no sense to evaluate a goalie based on their performance in just a few games because of the variation between events occuring in each hockey game. One of the many mysteries surrounding goalies is after how many games can we draw definitive conclusions about their quality of play? Let’s look at how many games goalies were playing per season and over their whole careers.
gp_szn <-
ggplot(hr_data, aes(gp)) +
geom_histogram(binwidth = 5, fill = '#a7d2cb', alpha = 0.8) +
labs(x = 'Games played per season by goalies', y = 'Frequency') +
theme(plot.caption = element_text(color = 'grey50'))
gp_overall <-
hr_data %>%
group_by(player) %>%
summarise(gp = sum(gp)) %>%
ggplot(aes(gp)) +
geom_histogram(binwidth = 25, fill = 'thistle', alpha = 0.8) +
labs(x = 'Games played overall by goalies', y = 'Frequency',
caption = '1 Game played = 1 or more minute played on ice') +
theme(plot.caption = element_text(color = 'grey50'))
gp_szn + gp_overall
In the case of this analysis, we were interested specifically in goalies who had played 40 or more games in the NHL from 2013-2019. Our reasoning behind this is that a full NHL season is 82 games so 40 games is about half a “season” played and seems like a reasonably large enough sample to evaluate a goalie’s play. The one downside to this evaluation is that it will devalue the contributions of back-up goalies, but back-up goalies usually don’t last as long as starters anyway and thus, their contracts tend to stay around league minimum.
Fact: In 2018-19, between goalies getting constantly blown out and injured, the Philadelphia Flyers were infamous for their goaltending trouble. They iced 8 goalies during the regular season, the highest of any team post-lockout.
read_csv('data/flyers_goaliedata_19.csv') %>%
clean_names() %>%
arrange(desc(gp)) %>%
select(Player = player, Team = tm, `Games played` = gp, `Save %` = sv_percent) %>%
kable() %>%
kable_styling(bootstrap_options = c('condensed', 'hover'), full_width = F)
| Player | Team | Games played | Save % |
|---|---|---|---|
| Carter Hart | PHI | 31 | 0.917 |
| Brian Elliott | PHI | 26 | 0.907 |
| Anthony Stolarz | PHI | 12 | 0.902 |
| Calvin Pickard | PHI | 11 | 0.863 |
| Michal Neuvirth | PHI | 7 | 0.859 |
| Cam Talbot | PHI | 4 | 0.881 |
| Alex Lyon | PHI | 2 | 0.806 |
| Mike McKenna | PHI | 1 | 0.833 |
It is well-known that most hockey players origniate from North America, Russia, or Scandinavia. Is this true for goalies specifically?
goalie_countries <-
cf_data %>%
count(country, name = "num_of_goalies") %>%
na.omit()
world <-
ne_countries(
scale = "medium",
returnclass = "sf",
continent = c('North America', 'Europe')
)
world %>%
left_join(goalie_countries, by = c('name' = 'country')) %>%
ggplot() +
geom_sf(aes(fill = num_of_goalies)) +
labs(fill = "Number of goalies") +
scale_fill_fish(option = "Prionace_glauca", direction = -1) +
theme_void() +
theme(legend.position = "bottom",
legend.key.height = unit(2, 'mm'),
legend.text = element_text(size = 7),
legend.title = element_text(size = 8))
We can see that this holds true for goalies for the most part, but it’s interesting to note that there are goalies who originate from Denmark and even the UK.
The old hockey idea is that goalies are supposed to be big. The taller and wider they are, the more space they take up in the net. However, with the evolution of hockey skill came the need for more athletic goalies with lightning reflexes. The butterfly style, now commonplace in the NHL, requires goalies to be flexible. So what do goalies look like in the NHL now?
cf_data_player_height_weight <-
cf_data %>%
select(player, weight, height, age, country) %>%
mutate(
weight = parse_number(sub(".*-","", weight)),
height = parse_number(sub(".*-","",height))
) %>%
mutate(age = case_when(
age < 20 ~ 'Under 20',
age >= 20 & age < 25 ~ '20-24',
age >= 25 & age < 30 ~ '25-29',
age >= 30 & age < 35 ~ '30-34',
age >= 35 & age < 40 ~ '35-39',
age > 40 ~ 'Over 40',
))
p <-
cf_data_player_height_weight %>%
ggplot(aes(height, weight)) +
geom_jitter(aes(color = age, label = player, label2 = country),
alpha = 0.65, size = 3) +
labs(x = 'Goalies\'s Height (cm)',
y = 'Goalies\'s Weight (kg)',
color = 'Age') +
scale_color_fish_d(option = 'Callanthias_australis')
ggplotly(p)
So we can see that the largest chunck of goalies seems to fall between 183 - 193cm (6’0 - 6’3 ft) in height and 85 - 95kg (187 - 210lbs) in weight. So only slightly taller and heavier than the average adult male, which is what you would expect from a professional athelete but is not quite what you would expect if you tend to think of goalies as big players.
hr_data <- hr_data %>% mutate(w_percent = w/gp)
Salaries <- cf_data %>%
group_by(player) %>%
summarise(mean_aav = mean(aav)) %>%
mutate(player = replace(player,player == "Marc-André Fleury","Marc-Andre Fleury")) %>%
mutate(player = replace(player,player == "Jaroslav Halák","Jaroslav Halak")) %>%
mutate(player = replace(player, player == "Eddie Läck","Eddie Lack")) %>%
mutate(player = replace(player, player == "Jacob Markström","Jacob Markstrom")) %>%
mutate(player = replace(player, player == "Petr Mrázek","Petr Mrazek"))
Best_Goalies <-
hr_data %>%
filter(!(player == "Martin Jones" & szn == "14-15")) %>%
group_by(player) %>%
summarise(gp = sum(gp),
w = sum(w),
ga = sum(ga),
sa = sum(sa),
sv = sum(sv),
) %>%
filter(gp > 120) %>%
mutate(mean_w_percent = w/gp,
mean_sv_percent = sv/sa,
mean_gaa = ga/gp,
avg_sv = mean(mean_sv_percent),
mean_gsaa = (sa * (1-avg_sv))- ga,
)
Best_Goalies <- Best_Goalies %>%
left_join(Salaries, by = c("player" = "player"))
Mean_sv_plot <-
Best_Goalies %>%
top_n(10, mean_sv_percent) %>%
ggplot(aes(x = mean_sv_percent, y = mean_aav)) +
geom_point(size = 2, color = '#a7d2cb') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Save Percentage", y = "Average Annual Salary")
Mean_gaa_plot <-
Best_Goalies %>%
top_n(-10, mean_gaa) %>%
ggplot(aes(x = mean_gaa, y = mean_aav)) +
geom_point(size = 2, color = '#f2d388') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Goals Against Average", y = "Average Annual Salary")
mean_w_percent_plot <-
Best_Goalies %>%
top_n(10, mean_w_percent) %>%
ggplot(aes(x = mean_w_percent, y = mean_aav)) +
geom_point(size = 2, color = '#c98474') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Percentage of Games Won", y = "Average Annual Salary")
mean_gsaa_plot <-
Best_Goalies %>%
top_n(10, mean_gsaa) %>%
ggplot(aes(x = mean_gsaa,y = mean_aav))+
geom_point(size = 2, color = '#874c62') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Goals Saved Above Average", y = "Average Annual Salary")
(Mean_sv_plot | Mean_gaa_plot) / (mean_w_percent_plot | mean_gsaa_plot)
# Finding the best goalies by Season
# creates tables for each stat we are looking at
top_goalie_sv <- function(season){
top_goalie_stats <- hr_data %>% filter(szn == season, gp > 40)
top_goalie_sv_percent <- top_goalie_stats %>% top_n(1,sv_percent) %>%
select(player, sv_percent, szn)
return(top_goalie_sv_percent)
}
top_goalie_gaa <- function(season){
top_goalie_stats <- hr_data %>%
filter(szn == season, gp > 40)
goalie_gaa <- top_goalie_stats %>%
top_n(-1,gaa) %>%
select(player, gaa, szn)
return(goalie_gaa)
}
top_goalie_w_percent <- function(season){
top_goalie_stats <- hr_data %>% filter(szn == season, gp > 40)
goalie_w_percent <- top_goalie_stats %>% top_n(1,w_percent) %>%
select(player, w_percent, szn)
return(goalie_w_percent)
}
top_goalie_gsaa <- function(season){
top_goalie_stats <- hr_data %>% filter(szn == season, gp > 40)
goalie_gsaa <- top_goalie_stats %>% top_n(1,gsaa) %>%
select(player, gsaa, szn)
return(goalie_gsaa)
}
#Loop to fill in the tables
best_sv_year <- tibble()
best_gaa_year <- tibble()
best_win_percent_year <- tibble()
best_gsaa_year <- tibble()
for (i in 1:7){
best_sv_year <- bind_rows(best_sv_year,top_goalie_sv(hr_szns[i]))
best_gaa_year <- bind_rows(best_gaa_year,top_goalie_gaa(hr_szns[i]))
best_win_percent_year <- bind_rows(best_win_percent_year,top_goalie_w_percent(hr_szns[i]))
best_gsaa_year <- bind_rows(best_gsaa_year,top_goalie_gsaa(hr_szns[i]))
}
#Plots to see best goalies by season
sv_plot <-
best_sv_year %>%
ggplot(aes(x = szn, y = sv_percent)) +
geom_point(size = 2, color = '#a7d2cb') +
geom_text_repel(aes(label = player), size = 3) +
coord_flip() +
labs(x = "Season", y = "Save Percentage")
gaa_plot <-
best_gaa_year %>%
ggplot(aes(x =szn, y = gaa)) +
geom_point(size = 2, color = '#f2d388') +
geom_text_repel(aes(label = player), size = 3) +
coord_flip() +
labs(x = "Season", y = "Goals Against Average")
w_percent_plot <-
best_win_percent_year %>%
ggplot(aes(x = szn, y = w_percent)) +
geom_point(size = 2, color = '#c98474') +
geom_text_repel(aes(label = player), size = 3) +
coord_flip() +
labs(x = "Season", y = "Win Percentage")
gsaa_plot <-
best_gsaa_year %>%
ggplot(aes(x = szn, y = gsaa)) +
geom_point(size = 2, color = '#874c62') +
geom_text_repel(aes(label = player), size = 3) +
coord_flip() +
labs(x = "Season", y = "Goals Saved Above Average")
year_plots <- (sv_plot | gaa_plot) / (w_percent_plot | gsaa_plot)
year_plots
# + plot_annotation(
# title = "Top Goalies by Season"
# )
#Worst Goalies by averages overall
worst_mean_sv_plot <- Best_Goalies %>%
top_n(-10,mean_sv_percent) %>%
ggplot(aes(x = mean_sv_percent, y = mean_aav)) +
geom_point(size = 2, color = '#a7d2cb') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Save Percentage", y = "Average Annual Salary")
worst_mean_gaa_plot <- Best_Goalies %>%
top_n(10,mean_gaa) %>%
ggplot(aes(x = mean_gaa, y = mean_aav)) +
geom_point(size = 2, color = '#f2d388') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Goals Against Average", y = "Average Annual Salary")
worst_mean_w_percent_plot <- Best_Goalies %>%
top_n(-10, mean_w_percent) %>%
ggplot(aes(x = mean_w_percent, y = mean_aav)) +
geom_point(size = 2, color = '#c98474') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Percentage of Games Won", y = "Average Annual Salary")
worst_mean_gsaa_plot <- Best_Goalies %>%
top_n(-10, mean_gsaa) %>%
ggplot(aes(x = mean_gsaa,y = mean_aav))+
geom_point(size = 2, color = '#874c62') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Goals Saved Above Average", y = "Average Annual Salary")
Worst_average_plots <- (worst_mean_sv_plot | worst_mean_gaa_plot) / (worst_mean_w_percent_plot | worst_mean_gsaa_plot)
Worst_average_plots
# Some really bad goalies by year
missing_salary <-
read_xls("data/MH_nhl_goalies_2017-2018.xls") %>%
clean_names() %>%
unite('player', c('first_name', 'last_name'), sep = ' ') %>%
filter(player == "Scott Darling") %>%
select(player, salary)
worst_goalies <- hr_data %>%
filter(gp > 40 & sv_percent < 0.9)
worst_salaries <- cf_data %>% select(player, aav, szn)
worst_goalies <- worst_goalies %>%
left_join(worst_salaries,by = c("player" = "player", "szn" = "szn")) %>%
left_join(missing_salary, by = c("player" = "player")) %>%
mutate(aav = replace_na(aav,0)) %>%
mutate(salary = replace_na(salary,0)) %>%
mutate(aav = aav+ salary) %>%
unite('player_szn', c('player','szn'), sep = ' | Season:', remove = FALSE) %>%
select(-salary) %>%
mutate(w_percent = w/gp)
wg_sv_percent <- worst_goalies %>%
ggplot(aes(x = sv_percent, y = aav)) +
geom_point(size = 2, color = '#a7d2cb') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Save Percentage", y = "Annual Average Salary")
wg_gsaa <- worst_goalies %>% ggplot(aes(x = gsaa, y = aav)) +
geom_point(size = 2, color = '#f2d388') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Goals Saved Above Average", y = "Annual Average Salary")
wg_gaa <- worst_goalies %>% ggplot(aes(x = gaa, y = aav)) +
geom_point(size = 2, color = '#c98474') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Goals Against Average", y = "Annual Average Salary")
wg_w_percent <- worst_goalies %>%
ggplot(aes(x = w_percent, y = aav)) +
geom_point(size = 2, color = '#874c62') +
geom_text_repel(aes(label = player), size = 3) +
labs(x = "Win Percentage", y = "Annual Average Salary")
Worst_goalies_plot <- (wg_sv_percent | wg_gaa) / (wg_w_percent | wg_gsaa)
Worst_goalies_plot
# + plot_annotation(
# title = "Worst perfoming goalies and how much they get paid"
# )
Red point is league average.
league_avgs <-
hr_data %>%
filter(gp > 20) %>%
group_by(szn) %>%
mutate(w_percent = w/gp) %>%
summarise(
avg_sv_percent = mean(sv_percent),
avg_gaa = mean(gaa),
avg_w_percent = mean(w_percent),
avg_gsaa = mean(gsaa)
)
starting_goalies_best <-
standings_data %>%
filter(rk == 1) %>%
select(-w, -l, -ol, -pts) %>%
left_join(read_csv('data/NHLTeams.csv')) %>%
left_join(hr_data, by = c('abbrev' = 'team', 'szn' = 'szn')) %>%
group_by(team, szn) %>%
top_n(1, gp) %>%
mutate(w_percent = w/gp)
starting_goalies_sv_percent <-
ggplot() +
geom_point(data = starting_goalies_best,
aes(x = szn, y = sv_percent, color = team),
show.legend = F, size = 4) +
geom_point(data = league_avgs,
aes(x = szn, y = avg_sv_percent),
show.legend = F, size = 3, color = 'red3') +
geom_text_repel(data = starting_goalies_best,
aes(x = szn, y = sv_percent, label = player),
size = 3) +
labs(y = 'Save percent', x = 'Season') +
ggtitle('Save percent') +
scale_color_fish_d(option = "Callanthias_australis") +
coord_flip() +
theme_hc()
starting_goalies_gaa <-
ggplot() +
geom_point(data = starting_goalies_best,
aes(x = szn, y = gaa, color = team),
show.legend = F, size = 4) +
geom_point(data = league_avgs,
aes(x = szn, y = avg_gaa),
show.legend = F, size = 3, color = 'red3') +
geom_text_repel(data = starting_goalies_best,
aes(x = szn, y = gaa, label = player),
size = 3) +
labs(y = 'Goals against avg', x = NULL) +
ggtitle('Goals against average') +
scale_color_fish_d(option = "Callanthias_australis") +
coord_flip()
starting_goalies_comb <- starting_goalies_sv_percent / starting_goalies_gaa
starting_goalies_comb
Now, we will dive into goalie contracts. How much have goalies been getting paid in the last 6 seasons and how has that been changing? Have contract values been going up or down? Who gets paid the most and do they work hard for the money?
When analyzing goalie contract, most emphasis was put on analyzing the average annual value (AAV) of the contract (i.e. how much they make per year). The way contracts work in the NHL (and most professional sports with a salary cap) can be confusing. Some contracts are front-heavy (get paid the majority of the contract in the first few years), some are padded with bonuses from the signing teams, some have salary retention from previous teams. All of this is done to try and circumvent the salary cap, on a team management level. Thus, we feel it’s safe to analyze goalies using their AAV since using actual salary would be hard to understand, interpret, and every salary varies from goalie to goalie and team to team. At the end of the day, every goalie gets the full-value of their contract and its easy to understand and interepret as if they got paid an equal amount of their contract every season.
So on average how much has an NHL goaltender, who has played more than 40 NHL games, gotten paid?
aav_gp_data <-
cf_data %>%
select(player, aav, szn) %>%
left_join(
cf_data %>%
group_by(player) %>%
summarise(gp = sum(gp)),
by = 'player'
) %>%
filter(gp > 40)
aav_gp_data %>%
mutate(szn = paste0(20, szn)) %>%
ggplot(aes(szn, aav, color = szn, fill = szn)) +
geom_violin(alpha = 0.8,
show.legend = F, trim = T, width = 0.9) +
geom_boxplot(width = 0.1, color = 'grey90', fill = 'grey50', alpha = 0.1) +
labs(x = NULL, y = 'Average annual value ($)') +
#ggtitle('Distribution of goalie average annual value') +
scale_y_continuous(labels = scales::dollar) +
fishualize::scale_color_fish_d(option = 'Scarus_globiceps') +
fishualize::scale_fill_fish_d(option = 'Scarus_globiceps') +
theme_classic() +
theme(axis.line = element_blank(),
axis.text.x = element_text(size = 12, color = 'grey25'),
axis.text.y = element_text(size = 7, color = 'grey25'),
axis.title.y = element_text(size = 9, color = 'grey25', face = 'italic'),
plot.title = element_text(hjust = 0.5, size = 16, color = 'grey15', face= 'italic'))
Another way goalies are weird is that goalies in the past have not usually made as much as their skater teammates. This is starting to change however. For example, Carey Price signed a contract extension that took effect in the 18-19 season for $10.5M per year over 8 years, $84M in total by the time the contract is up (see above plot). This was an unprecedented signing at the time, but as teams start to value good, consistent goalies more and more, we are likely to see more high-valued goalie contracts. We can see from the plot that the median AAV has shifted in the past two seasons from $2M to closer to $3M. Also the AAV distribution is more spread out in the last couple seasons, instead of being clumped all near the lower end.
cf_data %>%
group_by(exp_year) %>%
summarise(med_aav = median(aav)) %>%
ggplot(aes(exp_year, med_aav)) +
geom_step(color = 'turquoise3', size = 1) +
labs(x = 'Contract expiry year', y = 'Median average annual value of contract') +
scale_y_continuous(labels = scales::dollar) +
theme_classic() +
theme(axis.line = element_blank())
## Get top 10 paid players in league over last 7 years
top_10_paid <-
cf_data %>%
group_by(player) %>%
top_n(1, aav) %>%
ungroup() %>%
distinct(player, aav, team) %>%
top_n(10, aav)
## Plot top 10 paid players
top_10_paid %>%
mutate(team = substr(team, start = str_length(team) - 2, stop = str_length(team))) %>%
mutate(player = paste0(player, '\n(', team, ') ')) %>%
ggplot() +
geom_segment(aes(x = reorder(player, aav), xend = player, y = 0, yend = aav), color = 'grey50') +
geom_point(aes(x = player, y = aav, color = team), size = 5, show.legend = F) +
labs(x = NULL, y = 'Average annual value') +
#ggtitle('Top 10 paid players in the league (as of 2018-19)') +
coord_flip() +
scale_color_fish_d(option = 'Callanthias_australis') +
scale_y_continuous(labels = scales::dollar) +
theme_classic() +
theme(axis.line = element_blank(),
axis.text.y = element_text(size = 10, color = 'grey25'),
plot.title = element_text(hjust = 0.5, size = 16, color = 'grey25', face= 'italic'))
## Prep and combine data
cf_reg_data <-
cf_data %>%
mutate(
## Removes accents off names
player = stri_trans_general(player, 'latin-ascii'),
## Transform all names to lowercase
player = str_to_lower(player)) %>%
## Remove redundant cols
select(-weight, -height, -pos, -team, -age, -gp, -w, -l, -so, -gaa, -sv_percent)
hr_reg_data <-
hr_data %>%
## Transform all names to lowercase
mutate(player = str_to_lower(player)) %>%
# Filter for post lockout szns
filter(szn != '12-13') %>%
# Fix nicknames
mutate(player = case_when(
str_detect(player, 'cal heeter') ~ str_replace(player, 'cal heeter', 'calvin heeter'),
str_detect(player, 'matt o\'connor') ~ str_replace(player, 'matt o\'connor', 'matthew o\'connor'),
str_detect(player, 'eddie pasquale') ~ str_replace(player, 'eddie pasquale', 'edward pasquale'),
TRUE ~ as.character(player)
))
mp_reg_data <-
mp_data %>%
# Filter data for all situations (5v5, player-down, player-up)
filter(situation == 'all', szn != '12-13') %>%
# Remove redundant cols
select(-player_id, -team, -season, -position, -games_played,
-penality_minutes, -penalties, -situation, -goals) %>%
# Transform all names to lowercase
mutate(name = str_to_lower(name)) %>%
# Fix nicknames
mutate(name = case_when(
str_detect(name, 'tom mccollum') ~ str_replace(name, 'tom mccollum', 'thomas mccollum'),
str_detect(name, 'j.f. berube') ~ str_replace(name, 'j.f. berube', 'jean-francois berube'),
str_detect(name, 'j-f berube') ~ str_replace(name, 'j-f berube', 'jean-francois berube'),
str_detect(name, 'cal petersen') ~ str_replace(name, 'cal petersen', 'calvin petersen'),
TRUE ~ as.character(name)
))
# Join all data together
joined_data <-
hr_reg_data %>%
full_join(mp_reg_data, by = c('player' = 'name', 'szn' = 'szn')) %>%
inner_join(cf_reg_data, by = c('player', 'szn')) %>%
# Remove extra contract info (since we don't "know" this information yet)
select(-rk, -cap_hit_percent, -salary, -length, -cap_hit)
reg_data <-
joined_data %>%
# Code binary variables
mutate(handed = ifelse(handed == 'Left', 0, 1),
expiry = ifelse(expiry == 'UFA', 0, 1)) %>%
# Select numeric predictors
select_if(is.numeric)
reg_data %>% ncol()
## [1] 60
Calculate information gain for predictors and grab 10 highest predictors in information gain
| play_continued_in_zone | x_on_goal | unblocked_shot_attempts | x_freeze | min | icetime | low_danger_shots | high_dangerx_goals | gs | play_continued_outside_zone | aav |
|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 20.96 | 38 | 4.45 | 40 | 2400 | 17 | 0.68 | 1 | 11 | 900000 |
| 376 | 787.82 | 1498 | 179.44 | 1569 | 94166 | 870 | 19.83 | 24 | 397 | 1775000 |
| 726 | 1639.26 | 2844 | 383.84 | 3000 | 179926 | 1789 | 48.03 | 52 | 868 | 3187500 |
| 41 | 74.95 | 145 | 18.06 | 139 | 8023 | 79 | 1.78 | 3 | 23 | 625000 |
| 241 | 533.23 | 976 | 124.78 | 1094 | 65647 | 574 | 12.52 | 19 | 283 | 3416666 |
| 844 | 1820.57 | 3330 | 423.74 | 3084 | 184714 | 1964 | 42.03 | 49 | 879 | 2900000 |
Build regression model with selected features
## Model
mod <- lm(aav~., feat_selected)
summary(mod)
##
## Call:
## lm(formula = aav ~ ., data = feat_selected)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3810987 -779564 -36764 584496 5612561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686544.59 110073.72 6.237 9.13e-10 ***
## play_continued_in_zone 1228.14 2697.81 0.455 0.6491
## x_on_goal 2156.58 5600.18 0.385 0.7003
## unblocked_shot_attempts -1198.09 1224.44 -0.978 0.3283
## x_freeze 20710.43 20921.61 0.990 0.3227
## min -2924.02 5158.02 -0.567 0.5710
## icetime -22.11 84.85 -0.261 0.7945
## low_danger_shots -5424.90 2577.60 -2.105 0.0358 *
## high_dangerx_goals -4782.75 19989.19 -0.239 0.8110
## gs 351068.82 49849.96 7.043 5.90e-12 ***
## play_continued_outside_zone -806.60 2853.88 -0.283 0.7776
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1425000 on 529 degrees of freedom
## Multiple R-squared: 0.5562, Adjusted R-squared: 0.5478
## F-statistic: 66.29 on 10 and 529 DF, p-value: < 2.2e-16
new_data1 <-
joined_data %>%
filter(player == 'sergei bobrovsky', szn == '18-19') %>%
select(subset_info_gain)
predict(mod, new_data1)
## 1
## 5294396
new_data2 <-
joined_data %>%
filter(player == 'mikko koskinen') %>%
select(subset_info_gain)
predict(mod, new_data2)
## 1
## 4391092